Redundancy Elimination Within Large Collections of Files

نویسندگان

  • Purushottam Kulkarni
  • Fred Douglis
  • Jason D. LaVoie
  • John M. Tracey
چکیده

Ongoing advancements in technology lead to everincreasing storage capacities. In spite of this, optimizing storage usage can still provide rich dividends. Several techniques based on delta-encoding and duplicate block suppression have been shown to reduce storage overheads, with varying requirements for resources such as computation and memory. We propose a new scheme for storage reduction that reduces data sizes with an effectiveness comparable to the more expensive techniques, but at a cost comparable to the faster but less effective ones. The scheme, called Redundancy Elimination at the Block Level (REBL), leverages the benefits of compression, duplicate block suppression, and delta-encoding to eliminate a broad spectrum of redundant data in a scalable and efficient manner. REBL generally encodes more compactly than compression (up to a factor of 14) and a combination of compression and duplicate suppression (up to a factor of 6.7). REBL also encodes similarly to a technique based on delta-encoding, reducing overall space significantly in one case. Furthermore, REBL uses super-fingerprints, a technique that reduces the data needed to identify similar blocks while dramatically reducing the computational requirements of matching the blocks: it turns comparisons into hash table lookups. As a result, using super-fingerprints to avoid enumerating matching data objects decreases computation in the resemblance detection phase of REBL by up to a couple orders of magnitude.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Load Redundancy Removal through Instruction Reuse

Avinash Sodani , Gurindar S. Sohi, Dynamic instruction reuse, Proceedings of the LaVoie , John M. Tracey, Redundancy elimination within large collections of files, It is vital that these memory modules operate reliably, as memory failure can require the replacement of the entire socket. Load Value Approximation. 00146 00147 // It is not safe to merge these two switch instructions if they have a...

متن کامل

Information Curators in an Enterprise File-Sharing Service

We report on a social-software file-sharing service within a large company. User-created collections of files were associated with increased usage of the uploaded files, especially the sharing of files from one employee to another. Employees innovated in the use of the collections features as “information curators,” an emergent lead-user role in which one employee creates named, described colle...

متن کامل

A Dynamic Deduplication Approach for Big Data Storage

As data is increasing every day, so it is very challenging task to manage storage devices for this explosive growth of digital data. Data reduction has become very crucial problem. Deduplication approach plays a vital role to remove redundancy in large scale cluster computing storage. As a result, deduplication provides better storage utilization by eliminating redundant copies of data and savi...

متن کامل

Datasets for the Grid

Introduction The grid provides a framework for managing and processing very large collections of data. Files are a very important unit for data handling but are not convenient for expressing a collective data view because large data collections must span a large number of files. The large data volume also makes it desirable to express some data collections as collections or subsets of existing ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004